R and DuckDB Setup
library(readr)
library(tidyverse)
library(janitor)
library(gt)
library(reticulate)
library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)
drv <- duckdb()
con <- dbConnect(drv)Here is how we loaded and cleaned the data, and the problems that arose in the process.
library(readr)
library(tidyverse)
library(janitor)
library(gt)
library(reticulate)
library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)
drv <- duckdb()
con <- dbConnect(drv)# py_install("pandas")
import pandas as pd
import numpy as np
import great_tables as gt #this makes the python tables look nice
import warnings #this is to silence a warning about using .replace later in the code
from great_tables import GT
warnings.simplefilter(action='ignore', category=FutureWarning)These are the tables before cleaning and combining. Lots of unnecessary columns and inconsistent headings between the tables. Another issue to address is the quotations around the song titles on the Billboard chart. The song titles and artist names are also not consistent across tables.
Reading the csv files is pretty simple in R. read_csv reads the csv and the output can be saved. show_col_types = FALSE hides the console output that shows a count of the rows and columns and if the column is a chr, dbl, etc.
spotify_charts <- read_csv("initial_tables/spotify_top_charts_22.csv", show_col_types = FALSE)
spotify_chartstiktok_charts <- read_csv("initial_tables/TikTok_songs_2022.csv", show_col_types = FALSE)
tiktok_chartsbillboard_charts <- read_csv("initial_tables/Billboard_hot_100_year_end_2022.csv", show_col_types = FALSE)
billboard_chartsThis code makes (or replaces) a table from the csv file. SELECT * means that all columns in the csv will be brought over into the table. The tables only show five rows so that they do not take up too much of the page.
CREATE OR REPLACE TABLE spotify_charts AS
SELECT * FROM 'initial_tables/spotify_top_charts_22.csv';SELECT * FROM spotify_charts
LIMIT 5; --only show five rows| uri | artist_names | track_name | peak_rank | weeks_on_chart | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | tempo | time_signature | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| spotify:track:02MWAaffLxlfxAUY7c5dvx | Glass Animals | Heat Waves | 1 | 65 | 0.761 | 0.525 | 11 | -6.900 | 1 | 0.0944 | 0.4400 | 6.70e-06 | 0.0921 | 80.870 | 4 | 238805 |
| spotify:track:5PjdY0CKGZdEuoNab3yDmX | The Kid LAROI, Justin Bieber | STAY (with Justin Bieber) | 1 | 37 | 0.591 | 0.764 | 1 | -5.484 | 1 | 0.0483 | 0.0383 | 0.00e+00 | 0.1030 | 169.928 | 4 | 141806 |
| spotify:track:3IAfUEeaXRX9s9UdKOJrFI | Anitta | Envolver | 3 | 3 | 0.812 | 0.736 | 4 | -5.421 | 0 | 0.0833 | 0.1520 | 2.54e-03 | 0.0914 | 91.993 | 4 | 193806 |
| spotify:track:1HhNoOuqm1a5MXYEgAFl8o | Imagine Dragons, JID, Arcane, League of Legends | Enemy (with JID) - from the series Arcane League of Legends | 3 | 21 | 0.728 | 0.783 | 11 | -4.424 | 0 | 0.2660 | 0.2370 | 0.00e+00 | 0.4340 | 77.011 | 4 | 173381 |
| spotify:track:4fouWK6XVHhzl78KzQ1UjL | GAYLE | abcdefu | 1 | 19 | 0.695 | 0.540 | 4 | -5.692 | 1 | 0.0493 | 0.2990 | 0.00e+00 | 0.3670 | 121.932 | 4 | 168602 |
CREATE OR REPLACE TABLE tiktok_charts AS
SELECT * FROM 'initial_tables/TikTok_songs_2022.csv';SELECT * FROM tiktok_charts
LIMIT 5; --only show five rows| track_name | artist_name | artist_pop | album | track_pop | danceability | energy | loudness | mode | key | speechiness | acousticness | instrumentalness | liveness | valence | tempo | time_signature | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Running Up That Hill (A Deal With God) | Kate Bush | 81 | Hounds Of Love | 95 | 0.629 | 0.547 | -13.123 | 0 | 10 | 0.0550 | 0.7200 | 3.14e-03 | 0.0604 | 0.197 | 108.375 | 4 | 298933 |
| As It Was | Harry Styles | 91 | As It Was | 96 | 0.520 | 0.731 | -5.338 | 0 | 6 | 0.0557 | 0.3420 | 1.01e-03 | 0.3110 | 0.662 | 173.930 | 4 | 167303 |
| Sunroof | Nicky Youre | 73 | Sunroof | 44 | 0.768 | 0.716 | -5.110 | 1 | 10 | 0.0404 | 0.3500 | 0.00e+00 | 0.1500 | 0.841 | 131.430 | 4 | 163026 |
| Heat Waves | Glass Animals | 80 | Dreamland (+ Bonus Levels) | 89 | 0.761 | 0.525 | -6.900 | 1 | 11 | 0.0944 | 0.4400 | 6.70e-06 | 0.0921 | 0.531 | 80.870 | 4 | 238805 |
| About Damn Time | Lizzo | 81 | About Damn Time | 92 | 0.836 | 0.743 | -6.305 | 0 | 10 | 0.0656 | 0.0995 | 0.00e+00 | 0.3350 | 0.722 | 108.966 | 4 | 191822 |
CREATE OR REPLACE TABLE billboard_charts AS
SELECT * FROM 'initial_tables/Billboard_hot_100_year_end_2022.csv';SELECT * FROM billboard_charts
LIMIT 5; --only show five rows| No. | Title | Artist(s) |
|---|---|---|
| 1 | “Heat Waves” | Glass Animals |
| 2 | “As It Was” | Harry Styles |
| 3 | “Stay” | The Kid Laroi and Justin Bieber |
| 4 | “Easy on Me” | Adele |
| 5 | “Shivers” | Ed Sheeran |
Making the tables in Python is very similar to R. pd.read_csv reads the csv file and returns a DataFrame.
spotify_charts = pd.read_csv("initial_tables/spotify_top_charts_22.csv")
spotify_charts uri ... duration_ms
0 spotify:track:02MWAaffLxlfxAUY7c5dvx ... 238805
1 spotify:track:5PjdY0CKGZdEuoNab3yDmX ... 141806
2 spotify:track:3IAfUEeaXRX9s9UdKOJrFI ... 193806
3 spotify:track:1HhNoOuqm1a5MXYEgAFl8o ... 173381
4 spotify:track:4fouWK6XVHhzl78KzQ1UjL ... 168602
.. ... ... ...
641 spotify:track:0So2sgVa8aJiARPl2P29u2 ... 187333
642 spotify:track:1TUuhV75FeOF6UObRsABo2 ... 226975
643 spotify:track:3GbQcjKWT7eVpUfzqwUjxv ... 180666
644 spotify:track:7My9ca9QEIR8MvIOdIrTWA ... 215120
645 spotify:track:4axSuOg3BqsowKjRpj59RU ... 264439
[646 rows x 17 columns]
tiktok_charts = pd.read_csv("initial_tables/TikTok_songs_2022.csv")
tiktok_charts track_name ... duration_ms
0 Running Up That Hill (A Deal With God) ... 298933
1 As It Was ... 167303
2 Sunroof ... 163026
3 Heat Waves ... 238805
4 About Damn Time ... 191822
.. ... ... ...
258 The Less I Know The Better ... 216320
259 Dandelions ... 233720
260 Jimmy Cooks (feat. 21 Savage) ... 218365
261 Good Looking ... 214800
262 INFERNO ... 133134
[263 rows x 18 columns]
billboard_charts = pd.read_csv("initial_tables/Billboard_hot_100_year_end_2022.csv")
billboard_charts No. Title Artist(s)
0 1 "Heat Waves" Glass Animals
1 2 "As It Was" Harry Styles
2 3 "Stay" The Kid Laroi and Justin Bieber
3 4 "Easy on Me" Adele
4 5 "Shivers" Ed Sheeran
.. ... ... ...
95 96 "Flower Shops" Ernest featuring Morgan Wallen
96 97 "To the Moon" Jnr Choi and Sam Tompkins
97 98 "Unholy" Sam Smith and Kim Petras
98 99 "One Mississippi" Kane Brown
99 100 "Circles Around This Town" Maren Morris
[100 rows x 3 columns]
All the tables are messy. Both the Spotify and TikTok tables have columns that are not necessary for this project. To keep the tables simple, we cleaned the data to make all three tables as similar as possible. We renamed all the columns holding the song title to song_title, the artist(s) artist, and the popularity/rank to rank_s/rank_t/rank_b for Spotify, TikTok, and Billboard, respectively. We also included the tempo column on the Spotify and TikTok tables. The quotations are also removed from the song titles in the Billboard chart as well.
For the spotify_charts, we filterd to keep on the the columns containing the song title, artist anme, peak rank, and tempo. We reordered and renamed the columns as follows: song_title, artist, rank_s, and tempo. The rank_s column indicated the highest rank the song appeared on the chart, so a lower number indicated that the song was more popular.
For the Spotify data, we used select() to only keep the relevant columns.
spotify_cleaned <- spotify_charts |>
clean_names() |> #clean the column headers
select(track_name, artist_names, peak_rank, tempo) |> #select only certain columns
rename(song_title = track_name, artist = artist_names, rank_s = peak_rank) |>
arrange(rank_s) #order from most to least popular (lower number = more popular)
spotify_cleanedSimilar to in R, we used SELECT to only keep the relevant columns and used ORDER BY to arrange the rows from most to least popular.
--select only relevant columns and rename them appropriately
CREATE OR REPLACE TABLE spotify_cleaned AS
SELECT track_name AS song_title, artist_names AS artist, peak_rank AS rank_s, tempo
FROM spotify_charts
;
--order from most to least popular using the rank_s column (lower number = more popular)
FROM spotify_cleaned
ORDER BY rank_s ASC| song_title | artist | rank_s | tempo |
|---|---|---|---|
| Heat Waves | Glass Animals | 1 | 80.870 |
| STAY (with Justin Bieber) | The Kid LAROI, Justin Bieber | 1 | 169.928 |
| abcdefu | GAYLE | 1 | 121.932 |
| Easy On Me | Adele | 1 | 141.981 |
| good 4 u | Olivia Rodrigo | 1 | 166.928 |
| Blinding Lights | The Weeknd | 1 | 171.005 |
| MONTERO (Call Me By Your Name) | Lil Nas X | 1 | 178.781 |
| drivers license | Olivia Rodrigo | 1 | 143.875 |
| DÁKITI | Bad Bunny, Jhay Cortez | 1 | 109.928 |
| Beggin’ | Måneskin | 1 | 134.002 |
For the Spotify data, we used .filter to only keep the relevant columns and .sort_values to sort the rows.
spotify_cleaned = (spotify_charts
.filter(items = ['track_name', 'artist_names', 'peak_rank', 'tempo'])
.rename(columns = {'track_name': 'song_title',
'artist_names': 'artist',
'peak_rank': 'rank_s'})
.sort_values('rank_s') # lower number = higher rank
)
spotify_cleaned song_title ... tempo
0 Heat Waves ... 80.870
1 STAY (with Justin Bieber) ... 169.928
4 abcdefu ... 121.932
87 Peaches (feat. Daniel Caesar & Giveon) ... 90.030
93 Sunflower - Spider-Man: Into the Spider-Verse ... 89.960
.. ... ... ...
500 Hold That Heat (feat. Travis Scott) ... 130.045
561 Si Estuviésemos Juntos ... 171.854
310 Heartless ... 87.999
583 Paris to Tokyo ... 146.733
609 Sehnsucht ... 142.090
[646 rows x 4 columns]
Similar to the Spotify dataset, we kept only the relevant columns and renamed them to be the same as the Spotify dataset. The rank_t column indicates the popularity of a song, with a higher value indicating that the song achieved greater popularity.
tiktok_cleaned <- tiktok_charts |>
clean_names() |> #clean the column headers
select(track_name, artist_name, track_pop, tempo) |> #filter for certain columns
rename(song_title = track_name, artist = artist_name, rank_t = track_pop) |>
arrange(desc(rank_t)) #arrange by popularity (higher number = more popular)
tiktok_cleaned--create a cleaned version of the tiktok table and select only relevant columns
CREATE OR REPLACE TABLE tiktok_cleaned AS
SELECT track_name AS song_title, artist_name AS artist, track_pop AS rank_t, tempo
FROM tiktok_charts
;
--order from most to least popular using the rank_t column
FROM tiktok_cleaned
ORDER BY rank_t DESC| song_title | artist | rank_t | tempo |
|---|---|---|---|
| Glimpse of Us | Joji | 97 | 169.914 |
| As It Was | Harry Styles | 96 | 173.930 |
| Running Up That Hill (A Deal With God) | Kate Bush | 95 | 108.375 |
| Late Night Talking | Harry Styles | 93 | 114.996 |
| About Damn Time | Lizzo | 92 | 108.966 |
| Jimmy Cooks (feat. 21 Savage) | Drake | 92 | 165.921 |
| MIDDLE OF THE NIGHT | Elley Duhé | 91 | 185.727 |
| Until I Found You | Stephen Sanchez | 90 | 101.358 |
| Blinding Lights | The Weeknd | 90 | 171.005 |
| Sweater Weather | The Neighbourhood | 90 | 124.053 |
tiktok_cleaned = (tiktok_charts
.filter(items = ['track_name', 'artist_name', 'track_pop', 'tempo'])
.rename(columns = {'track_name': 'song_title',
'artist_name': 'artist',
'track_pop': 'rank_t'})
.sort_values('rank_t', ascending=False) # higher numbers = more popular
)
tiktok_cleaned song_title artist rank_t tempo
5 Glimpse of Us Joji 97 169.914
1 As It Was Harry Styles 96 173.930
0 Running Up That Hill (A Deal With God) Kate Bush 95 108.375
52 Late Night Talking Harry Styles 93 114.996
260 Jimmy Cooks (feat. 21 Savage) Drake 92 165.921
.. ... ... ... ...
131 What If HOURS 0 126.001
184 Backyard Boy Claire Rosinkranz 0 138.026
130 Slow Down Coopex 0 128.048
42 Without You John De Sohn 0 124.968
200 Right Here Waiting Boostereo 0 115.052
[263 rows x 4 columns]
The billboard_chart was almost the same as the previous two tables. The columsn kept are the columns containing the song title, artist, and rank. The columns were named song_title, artist, and rank_b. The rank_b column indicated the rank of the song, with a lower number indicating a song that achieved a higher rank.
billboard_cleaned <- billboard_charts |>
clean_names() |> #clean the column headers
select(title, artist_s, no) |> #reorder the columns
rename(song_title = title, rank_b = no, artist = artist_s) |>
mutate(song_title = str_replace_all(song_title, '"', '')) #take quotations off song titles
billboard_cleaned--create a cleaned version of the billboard table and select only renamed relevant columns
CREATE OR REPLACE TABLE billboard_cleaned AS
SELECT replace(title, '"', '') AS song_title, "artist(s)" AS artist, "no." AS rank_b
FROM billboard_charts
;
--order from most to least popular using the rank_b column
FROM billboard_cleaned
ORDER BY rank_b| song_title | artist | rank_b |
|---|---|---|
| Heat Waves | Glass Animals | 1 |
| As It Was | Harry Styles | 2 |
| Stay | The Kid Laroi and Justin Bieber | 3 |
| Easy on Me | Adele | 4 |
| Shivers | Ed Sheeran | 5 |
| First Class | Jack Harlow | 6 |
| Big Energy | Latto | 7 |
| Ghost | Justin Bieber | 8 |
| Super Gremlin | Kodak Black | 9 |
| Cold Heart (Pnau Remix) | Elton John and Dua Lipa | 10 |
billboard_cleaned = (billboard_charts
.rename(columns = {'Title': 'song_title',
'Artist(s)': 'artist',
'No.': 'rank_b'})
.sort_values('rank_b') # highest to lowest rank
)
billboard_cleaned['song_title'] = billboard_cleaned['song_title'].str.replace('"', '')
billboard_cleaned rank_b song_title artist
0 1 Heat Waves Glass Animals
1 2 As It Was Harry Styles
2 3 Stay The Kid Laroi and Justin Bieber
3 4 Easy on Me Adele
4 5 Shivers Ed Sheeran
.. ... ... ...
95 96 Flower Shops Ernest featuring Morgan Wallen
96 97 To the Moon Jnr Choi and Sam Tompkins
97 98 Unholy Sam Smith and Kim Petras
98 99 One Mississippi Kane Brown
99 100 Circles Around This Town Maren Morris
[100 rows x 3 columns]
Two challenges stood out to us, both involving the variations in the way that the song metadata is recorded on the charts. The same song can appear on different albums with slgiht variations of the title, or can have a featured artist listed in the title or as an additional artist, or could be a cover of another song, etc. As an example, the song “Enemy” by Imagine Dragons appears twice on the Spotify chart, once on the TikTok chart, and once on the Billboard chart:
spotify_cleaned |>
filter(str_detect(song_title, 'Enemy')) #filter for song titles containing "Enemy"tiktok_cleaned |>
filter(str_detect(song_title, 'Enemy'))billboard_cleaned |>
filter(str_detect(song_title, 'Enemy'))SELECT *
FROM spotify_cleaned
WHERE song_title LIKE '%Enemy%'| song_title | artist | rank_s | tempo |
|---|---|---|---|
| Enemy (with JID) - from the series Arcane League of Legends | Imagine Dragons, JID, Arcane, League of Legends | 3 | 77.011 |
| Enemy - From the series Arcane League of Legends | Imagine Dragons, Arcane, League of Legends | 171 | 77.029 |
SELECT *
FROM tiktok_cleaned
WHERE song_title LIKE '%Enemy%'| song_title | artist | rank_t | tempo |
|---|---|---|---|
| Enemy (with JID) - from the series Arcane League of Legends | Imagine Dragons | 87 | 77.011 |
SELECT *
FROM billboard_cleaned
WHERE song_title LIKE '%Enemy%'| song_title | artist | rank_b |
|---|---|---|
| Enemy | Imagine Dragons and JID | 15 |
GT(spotify_cleaned.query('song_title.str.contains("Enemy")', engine='python'))| song_title | artist | rank_s | tempo |
|---|---|---|---|
| Enemy (with JID) - from the series Arcane League of Legends | Imagine Dragons, JID, Arcane, League of Legends | 3 | 77.011 |
| Enemy - From the series Arcane League of Legends | Imagine Dragons, Arcane, League of Legends | 171 | 77.029 |
GT(tiktok_cleaned.query('song_title.str.contains("Enemy")', engine='python'))| song_title | artist | rank_t | tempo |
|---|---|---|---|
| Enemy (with JID) - from the series Arcane League of Legends | Imagine Dragons | 87 | 77.011 |
GT(billboard_cleaned.query('song_title.str.contains("Enemy")', engine='python'))| rank_b | song_title | artist |
|---|---|---|
| 15 | Enemy | Imagine Dragons and JID |
All three charts record the song title differently. The Spotify chart has two different versions of “Enemy” on it: one with JID and one without, so leaving those titles as is is fine. The TikTok and Billboard charts both have the version with JID, but the name of the song and artists listed are different. To handle these synonyms, we created a lookup table and applied it to each dataset. We chose to apply the dictionary to each individual table rather than a combined table so that when we combine the tables, songs with the same title and artists would be combined. The tables for the song titles and artist names were created in Google Sheets and uploaded to RStudio as csv files.
The process in the three languages was essentially the same. We read the csv file and stored the resulting table. Then we merged/joined the lookup table with each chart. The lookup table has two columns: one containing the canonical names and another containing the alternative names.
Read the lookup table.
song_title_lookup <- read_csv("initial_tables/dict_song_titles.csv", show_col_types = FALSE)
song_title_lookupWe used a left_join and joined on the song_title column of the song chart on the alt_name column of the lookup table. This join created a new column called canonical_name on the chart. mutate is used to replace the value in the song_title columns with the name in the canonical_name, if there is one. coalesce takes the first non-null value, so if the canonical_name column is empty, then the song title will not be replaced.
spotify_cleaned <- spotify_cleaned |>
left_join(song_title_lookup, by = join_by(song_title == alt_name)) |>
mutate(song_title = coalesce(canonical_name, song_title)) |> #coalesce takes the first non-null value
select(song_title, artist, rank_s, tempo) #keep only relevant columnstiktok_cleaned <- tiktok_cleaned |>
left_join(song_title_lookup, by = join_by(song_title == alt_name)) |>
mutate(song_title = coalesce(canonical_name, song_title)) |>
select(song_title, artist, rank_t, tempo)billboard_cleaned <- billboard_cleaned |>
left_join(song_title_lookup, by = join_by(song_title == alt_name)) |>
mutate(song_title = coalesce(canonical_name, song_title)) |>
select(song_title, artist, rank_b)Read the look table csv and create a table.
CREATE OR REPLACE TABLE song_title_lookup AS
SELECT *
FROM 'initial_tables/dict_song_titles.csv';We used UPDATE to avoid having to coalesce columns. This code checks if a song listed in the song title column in the chart is the same as a song listed in the alt_name column of the lookup table. If there is a match, then the song title is replaced with the song from the canonical_name column of the lookup table.
UPDATE spotify_cleaned
SET song_title = song_title_lookup.canonical_name
FROM song_title_lookup
WHERE spotify_cleaned.song_title = song_title_lookup.alt_nameUPDATE tiktok_cleaned
SET song_title = song_title_lookup.canonical_name
FROM song_title_lookup
WHERE tiktok_cleaned.song_title = song_title_lookup.alt_nameUPDATE billboard_cleaned
SET song_title = song_title_lookup.canonical_name
FROM song_title_lookup
WHERE billboard_cleaned.song_title = song_title_lookup.alt_nameWe have two different methods in Python here. This method uses .merge and is similar to the R version. The lookup table csv file is read. Then the chart table and the lookup table are left-merged on the song_title and alt_name. This creates two new columns in the chart called canonical_name and alt_name. If there is a value in the alt_name columns, the the value from the canonical_name column replaces the song_title.
song_syn = pd.read_csv('initial_tables/dict_song_titles.csv')
# Spotify
spotify_cleaned = spotify_cleaned.merge(song_syn, left_on = 'song_title', right_on = 'alt_name', how = 'left')
spotify_cleaned['song_title'] = np.where( #if the alt_name column has a value then, replace song_title with canonical_name
spotify_cleaned['alt_name'].notna(),
spotify_cleaned['canonical_name'],
spotify_cleaned['song_title'])
spotify_cleaned = spotify_cleaned.drop(columns = ['canonical_name', 'alt_name']) # remove unnecessary columns
# TikTok
tiktok_cleaned = tiktok_cleaned.merge(song_syn, left_on = 'song_title', right_on = 'alt_name', how = 'left')
tiktok_cleaned['song_title'] = np.where(
tiktok_cleaned['alt_name'].notna(),
tiktok_cleaned['canonical_name'],
tiktok_cleaned['song_title'])
tiktok_cleaned = tiktok_cleaned.drop(columns = ['canonical_name', 'alt_name'])
# Billboard
billboard_cleaned = billboard_cleaned.merge(song_syn, left_on = 'song_title', right_on = 'alt_name', how = 'left')
billboard_cleaned['song_title'] = np.where(
billboard_cleaned['alt_name'].notna(),
billboard_cleaned['canonical_name'],
billboard_cleaned['song_title'])
billboard_cleaned = billboard_cleaned.drop(columns = ['canonical_name', 'alt_name'])The second Python method uses .replace. This method was my initial inclination regarding how to recode the charts, but is not the best appraoch to this problem. This method works be converting the lookup table csv file into a dictionary of keys and values. The alt_name column is set as the index and is mapped to the canonical_name column. to_dict converts the result to a dictionary.
.replace looks for the song title in the keys of the dictionary and replaces the song_title with the canonical name from the dictionary. inplace modifies the original DataFrame directly. Using it spits out a “future warning” because there are planned changes for inplace.
# import the table and convert to a dictionary using to.dict()
song_title_dict = pd.read_csv('initial_tables/dict_song_titles.csv', header=0).set_index('alt_name')['canonical_name'].to_dict()
# replace the song_title with the canonical_name
spotify_cleaned['song_title'].replace(song_title_dict, inplace = True)
tiktok_cleaned['song_title'].replace(song_title_dict, inplace = True)
billboard_cleaned['song_title'].replace(song_title_dict, inplace = True)The we did the exact same thing but for the artist names. The lookup table in this instance had two columns: canonical_name and alt_name. The process was essentially the same as changing the song titles.
dict_artist <- read_csv("initial_tables/dict_artist.csv", show_col_types = FALSE)
dict_artistspotify_cleaned <- spotify_cleaned |>
left_join(dict_artist, by = join_by(artist == alt_name)) |>
mutate(song_title = coalesce(canonical_name, song_title)) |>
select(song_title, artist, rank_s, tempo)tiktok_cleaned <- tiktok_cleaned |>
left_join(dict_artist, by = join_by(artist == alt_name)) |>
mutate(song_title = coalesce(canonical_name, song_title)) |>
select(song_title, artist, rank_t, tempo)billboard_cleaned <- billboard_cleaned |>
left_join(dict_artist, by = join_by(artist == alt_name)) |>
mutate(song_title = coalesce(canonical_name, song_title)) |>
select(song_title, artist, rank_b)CREATE TABLE dict_artist AS
SELECT * FROM 'initial_tables/dict_artist.csv';UPDATE spotify_cleaned
SET artist = dict_artist.canonical_name
FROM dict_artist
WHERE spotify_cleaned.artist = dict_artist.alt_nameUPDATE tiktok_cleaned
SET artist = dict_artist.canonical_name
FROM dict_artist
WHERE tiktok_cleaned.artist = dict_artist.alt_nameUPDATE billboard_cleaned
SET artist = dict_artist.canonical_name
FROM dict_artist
WHERE billboard_cleaned.artist = dict_artist.alt_namedict_artist = pd.read_csv('initial_tables/dict_artist.csv')
# Spotify
spotify_cleaned = spotify_cleaned.merge(dict_artist, left_on = 'artist', right_on = 'alt_name', how = 'left')
spotify_cleaned['artist'] = np.where( #if the alt_name column has a value then, replace artist with canonical_name
spotify_cleaned['alt_name'].notna(),
spotify_cleaned['canonical_name'],
spotify_cleaned['artist'])
spotify_cleaned = spotify_cleaned.drop(columns = ['canonical_name', 'alt_name'])
# TikTok
tiktok_cleaned = tiktok_cleaned.merge(dict_artist, left_on = 'artist', right_on = 'alt_name', how = 'left')
tiktok_cleaned['artist'] = np.where(
tiktok_cleaned['alt_name'].notna(),
tiktok_cleaned['canonical_name'],
tiktok_cleaned['artist'])
tiktok_cleaned = tiktok_cleaned.drop(columns = ['canonical_name', 'alt_name'])
# Billboard
billboard_cleaned = billboard_cleaned.merge(dict_artist, left_on = 'artist', right_on = 'alt_name', how = 'left')
billboard_cleaned['artist'] = np.where(
billboard_cleaned['alt_name'].notna(),
billboard_cleaned['canonical_name'],
billboard_cleaned['artist'])
billboard_cleaned = billboard_cleaned.drop(columns = ['canonical_name', 'alt_name'])dict_artist = pd.read_csv('initial_tables/dict_artist.csv', header=0).set_index('alt_name')['canonical_name'].to_dict()
spotify_cleaned['artist'].replace(dict_artist, inplace = True)
tiktok_cleaned['artist'].replace(dict_artist, inplace = True)
billboard_cleaned['artist'].replace(dict_artist, inplace = True)We did not print the tables above as they would have looked nearly identical to the original tables. Only a few songs out of about 900 songs were changed, so the changes would be difficult to see if the entire table was printed. Here is the same example from above using “Enemy.”
spotify_cleaned |>
filter(str_detect(song_title, 'Enemy')) #filter for song titles containing "Enemy"tiktok_cleaned |>
filter(str_detect(song_title, 'Enemy'))billboard_cleaned |>
filter(str_detect(song_title, 'Enemy'))SELECT *
FROM spotify_cleaned
WHERE song_title LIKE '%Enemy%'| song_title | artist | rank_s | tempo |
|---|---|---|---|
| Enemy (with JID) - from the series Arcane League of Legends | Imagine Dragons, JID, Arcane, League of Legends | 3 | 77.011 |
| Enemy - From the series Arcane League of Legends | Imagine Dragons, Arcane, League of Legends | 171 | 77.029 |
SELECT *
FROM tiktok_cleaned
WHERE song_title LIKE '%Enemy%'| song_title | artist | rank_t | tempo |
|---|---|---|---|
| Enemy (with JID) - from the series Arcane League of Legends | Imagine Dragons | 87 | 77.011 |
SELECT *
FROM billboard_cleaned
WHERE song_title LIKE '%Enemy%'| song_title | artist | rank_b |
|---|---|---|
| Enemy (with JID) - from the series Arcane League of Legends | Imagine Dragons and JID | 15 |
GT(spotify_cleaned.query('song_title.str.contains("Enemy")', engine='python'))| song_title | artist | rank_s | tempo |
|---|---|---|---|
| Enemy (with JID) - from the series Arcane League of Legends | Imagine Dragons, JID, Arcane, League of Legends | 3 | 77.011 |
| Enemy - From the series Arcane League of Legends | Imagine Dragons, Arcane, League of Legends | 171 | 77.029 |
GT(tiktok_cleaned.query('song_title.str.contains("Enemy")', engine='python'))| song_title | artist | rank_t | tempo |
|---|---|---|---|
| Enemy (with JID) - from the series Arcane League of Legends | Imagine Dragons | 87 | 77.011 |
GT(billboard_cleaned.query('song_title.str.contains("Enemy")', engine='python'))| rank_b | song_title | artist |
|---|---|---|
| 15 | Enemy (with JID) - from the series Arcane League of Legends | Imagine Dragons and JID |